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ABSTRACT 


Minimizing  memory  requirements  for  program  and  data  are  critical  objectives  when  syn¬ 
thesizing  software  for  embedded  DSP  applications.  In  prior  work,  it  has  been  demonstrated  that 
for  graphical  DSP  programs  based  on  the  widely-used  synchronous  dataflow  model,  an  important 
class  of  minimum  code  size  implementations  can  be  viewed  as  parenthesizations  of  lexical  order¬ 
ings  of  the  computational  blocks.  Such  a  parenthesization  corresponds  to  the  hierarchy  of  loops  in 
the  software  implementation.  In  this  paper,  we  present  a  dynamic  programming  technique  for 
constructing  a  parenthesization  that  minimizes  data  memory  cost  from  a  given  lexical  ordering  of 
a  synchronous  dataflow  graph.  For  graphs  that  do  not  contain  delays  on  the  edges,  this  technique 
always  constructs  a  parenthesization  that  has  minimum  data  memory  cost  from  among  all  paren¬ 
thesizations  for  the  given  lexical  ordering.  When  delays  are  present,  the  technique  may  make 
refinements  to  the  lexical  ordering  while  it  is  computing  the  parenthesization,  and  the  data  mem¬ 
ory  cost  of  the  result  is  guaranteed  to  be  less  than  or  equal  to  the  data  memory  cost  of  all  valid 
parenthesizations  for  the  initial  (input)  lexical  ordering. 

Thus,  our  dynamic  programming  technique  can  be  used  to  post-optimize  the  output  for 
any  algorithm  that  schedules  SDF  graphs  into  minimum  code  size  loop  hierarchies.  On  several 
practical  examples,  we  demonstrate  that  significant  improvement  can  be  gained  by  such  post-opti¬ 
mization  when  applied  to  two  scheduling  techniques  that  have  been  developed  earlier.  That  is,  the 
result  of  each  scheduling  algorithm  combined  with  the  post-optimization  often  requires  signifi¬ 
cantly  less  data  memory  than  the  result  of  the  scheduling  algorithm  without  post-optimization.  We 
also  present  an  adaptation  of  our  dynamic  programming  technique  for  post-optimizing  an  arbi¬ 
trary  (not  necessarily  minimum  code  size)  schedule  to  optimally  reduce  the  code  size. 
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1.  Background 


This  paper  develops  a  dynamic  programming  technique  for  reducing  memory  require¬ 
ments  when  synthesizing  software  from  graphical  DSP  programs  that  are  based  on  the  synchro¬ 
nous  dataflow  (SDF)  model  [10].  Numerous  DSP  design  environments,  including  a  number  of 
commercial  tools,  support  SDF  or  closely  related  models  [9,  14,  13,  15,  16].  In  SDF1,  a  program 
is  represented  by  a  directed  graph  in  which  each  vertex  (actor)  represents  a  computation,  an  edge 
specifies  a  FIFO  communication  channel,  and  each  actor  produces  (consumes)  a  fixed  number  of 
data  values  (tokens)  onto  (from)  each  output  (input)  edge  per  invocation. 

Fig.  1  shows  an  SDF  graph.  Each  edge  is  annotated  with  the  number  of  tokens  produced 
(consumed)  by  its  source  (sink)  actor,  and  the  “D”  on  the  edge  from  A  to  B  specifies  a  unit  delay. 
Given  an  SDF  edge  e ,  we  denote  the  source  and  sink  actors  by  src(e)  and  snk(e)  ,  and  we 
denote  the  delay  by  delay  (e)  .  Each  unit  of  delay  is  implemented  as  an  initial  token  on  the  edge. 
Also,  prod(e)  and  cons(e)  denote  the  number  of  tokens  produced  by  srv(e)  ,  and  consumed 
by  snk(e)  . 

The  first  step  in  compiling  an  SDF  graph  is  to  construct  a  valid  schedule,  which  is  a 
sequence  of  actor  invocations  that  invokes  each  actor  at  least  once,  does  not  deadlock,  and  pro¬ 
duces  no  net  change  in  the  number  of  tokens  queued  on  each  edge.  For  each  actor  in  the  valid 
schedule,  a  corresponding  code  block,  obtained  from  a  library  of  predefined  actors,  is  instantiated. 
The  resulting  sequence  of  code  blocks  is  encapsulated  within  an  infinite  loop  to  generate  a  soft¬ 
ware  implementation  of  the  SDF  graph  [8]. 

In  [10],  efficient  algorithms  are  presented  to  determine  whether  or  not  a  given  SDF  graph 
has  a  valid  schedule,  and  to  determine  the  minimum  number  of  times  that  each  actor  must  be  fired 
in  a  valid  schedule.  We  represent  these  minimum  numbers  of  firings  by  a  vector  q G ,  indexed  by 
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Figure  1 .  A  simple  SDF  graph. 


1.  This  should  not  be  confused  with  the  use  of  “synchronous”  in  synchronous  languages  [1], 
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the  actors  in  G .  We  refer  to  qG  as  the  repetitions  vector  of  G .  Given  an  edge  e  in  G ,  we  define 
TNSE(e)  =  qG(srv(e))  x  prod(e)  =  q G(snk(e))  x  cons(e)  .  (1) 

Thus,  TNSE(e)  is  the  total  number  of  tokens  produced  onto  (consumed  from)  e  in  one  period  of 
a  valid  schedule.  The  equality  of  the  two  products  in  (1)  follows  from  the  definition  of  the  repeti¬ 
tions  vector.  For  Fig.  1,  q (A,B,C)  =  (3,6,2)  ,  and  TNSE((A,B ))  =  TNSE((B,C))  =  6  . 

One  valid  schedule  for  Fig.  1  is  B(2AB)CA(3B)C  .  Here,  a  parenthesized  term 
(nSlS2  ■  ■ .  Sk)  specifies  n  successive  firings  of  the  “subschedule”  SlS2...Sk,  and  we  may  trans¬ 
late  such  a  term  into  a  loop  in  the  target  code.  This  notation  naturally  accommodates  the  represen¬ 
tation  of  nested  loops.  We  refer  to  each  parenthesized  term  ( /? .S’ ,  .S’ 2 . . .  Sk)  as  a  schedule  loop.  A 
looped  schedule  is  a  finite  sequence  Vl  V2. . .  V k,  where  each  V-  is  either  an  actor  or  a  schedule 
loop. 

A  more  compact  valid  schedule  for  Fig.  1  is  (3A)(2(35)C)  .  We  call  this  schedule  a 
single  appearance  schedule  since  it  contains  only  one  lexical  appearance  of  each  actor.  To  a 
good  first  approximation,  any  valid  single  appearance  schedule  gives  the  minimum  code  size  cost 
for  in-line  code  generation.  This  approximation  neglects  second  order  affects  such  as  loop  over¬ 
head  and  the  efficiency  of  data  transfers  between  actors.  Systematic  synthesis  of  single  appear¬ 
ance  schedules  is  described  in  [3]. 

Given  an  SDF  graph  G,  a  valid  schedule  S ,  and  an  edge  e  in  G,  max_tokens(e,  S ) 
denotes  the  maximum  number  of  tokens  that  are  queued  on  e  during  an  execution  of  S .  For 
example,  if  for  Fig.  1,  Sj  =  (3A)(65)(2C)  and  S2  =  (3A(25))(2C)  ,  then 
max_tokens((A,  B),  Sx)  =  7  and  max_tokens((A,  B),  S2)  =  3  .  We  define  the  buffer  mem¬ 
ory  requirement  of  a  schedule  S  by  buffer _memory  ( S )  =  max_tokens  ( e,  S)  ,  where  the 
summation  is  over  all  edges  in  G .  Thus,  buffer _memory  (S j)  =  7  +  6  =  13  ,  and 
buffer_memory(S2)  =  3  +  6  =  9. 

The  lexical  ordering  of  a  single  appearance  schedule  S ,  denoted  lexorder(S)  ,  is  the 
sequence  of  actors  (Aj,  A2,  ...,  An)  such  that  each  A;.  is  preceded  lexically  by  Av  A2,  ...,  A-_  j. 

Thus,  lexorder((2(3B)(5C))(7 A))  =  (5,  C,  A)  .  Given  an  SDF  graph,  an  order-optimal 

schedule  is  a  single  appearance  schedule  that  has  minimum  buffer  memory  requirement  from 


3 


among  the  valid  single  appearance  schedules  that  have  a  given  lexical  ordering 

In  the  model  of  buffering  implied  by  our  “buffer  memory  requirement”  measure,  each 
buffer  is  mapped  to  an  independent  contiguous  block  of  memory.  Although  perfectly  valid  target 
programs  can  be  generated  without  this  restriction,  it  can  be  shown  that  having  a  separate  buffer 
on  each  edge  is  advantageous  because  it  permits  full  exploitation  of  the  memory  savings  attain¬ 
able  from  nested  loops,  and  it  accommodates  delays  without  complication  [11].  Another  advan¬ 
tage  of  this  model  is  that  by  favoring  the  generation  of  nested  loops,  the  model  also  favors 
schedules  that  have  lower  latency  than  single  appearance  schedules  that  are  constructed  to  opti¬ 
mize  various  alternative  cost  measures  [11].  Combining  the  analysis  and  techniques  that  we 
develop  in  this  paper  with  methods  for  sharing  storage  among  multiple  buffers  is  a  useful  direc¬ 
tion  for  further  study. 

In  this  paper  we  present  a  dynamic  programming  technique  for  post-processing  a  single 
appearance  schedule  with  the  goal  of  generating  a  modified  single  appearance  schedule  that  has  a 
significantly  lower  buffer  memory  requirement.  The  technique  is  an  extension  of  the  algorithm 
developed  in  [12]  for  constructing  single  appearance  schedules  for  chain-structured  SDF  graphs, 
and  a  basic  version  of  this  technique,  called  Dynamic  Programming  Post  Optimization 
(DPPO),  that  applies  to  delayless,  acyclic  graphs  was  outlined  in  [12].  In  this  paper,  we  give  a 
detailed  specification  of  DPPO,  we  generalize  DPPO  to  handle  delays  and  arbitrary  topologies, 
and  we  show  that  while  DPPO  always  preserves  the  lexical  ordering  of  the  input  schedule,  our 
generalization  of  DPPO,  called  GDPPO,  may  change  the  lexical  ordering  (for  a  graph  that  has 
delays)  and  thus,  that  it  may  compute  a  schedule  that  has  lower  buffer  memory  requirement  than 
all  single  appearance  schedules  that  have  the  same  lexical  ordering  as  the  input  schedule.  Since 
the  introduction  of  the  basic  version  of  DPPO  in  [12],  we  have  also  developed  and  implemented 
two  heuristics,  called  APGAN  and  RPMC,  for  efficiently  constructing  single  appearance  sched¬ 
ules  that  have  low  buffer  memory  requirement  [2,  11]  for  acyclic  graphs.  In  this  paper,  we  present 
experimental  results  that  demonstrate  the  ability  of  GDPPO  to  significantly  improve  the  schedules 
constructed  by  APGAN  and  RPMC  for  practical  SDF  systems. 

A  distinguishing  characteristic  of  this  work,  as  compared  to  other  work  on  memory  opti¬ 
mizations  for  SDF  and  related  models,  is  that  it  focuses  on  the  joint  reduction  of  both  code  and 
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data  memory  requirements  for  uniprocessor  implementations.  In  contrast,  Govindarajan,  Gao,  and 
Desai  have  developed  scheduling  algorithms  to  minimize  data  memory  requirements,  without 
considering  code  size,  in  a  parallel  processing  context  [7].  Also,  Lauwereins,  Wauters,  Ade,  and 
Peperstraete  have  proposed  an  extension  to  SDF  called  cyclostatic  dataflow,  which  allows  an 
important  class  of  applications  to  be  described  in  such  a  way  that  significantly  less  token  traffic  is 
required  than  the  buffer  activity  that  would  result  from  the  corresponding  pure-SDF  implementa¬ 
tions  [4].  However,  the  impact  of  this  model  on  code  size  has  not  been  explored  in  depth,  nor  have 
code  size  optimizations  been  developed  that  exploit  the  unique  features  of  cyclostatic  dataflow. 

In  [17],  Ritz,  Willems,  and  Meyr  present  techniques  for  minimizing  the  memory  require¬ 
ments  of  a  class  of  single  appearance  schedules  that  minimize  the  rate  of  context- switches 
between  actors.  These  schedules  are  called  flat  schedules  since  they  do  not  apply  any  nested 
loops.  As  implied  above,  the  memory  requirements  associated  with  flat  schedules  are  often  signif¬ 
icantly  larger  than  the  nested-loop  schedules  that  we  discuss  in  this  paper,  even  if  techniques  are 
applied  to  share  memory  among  multiple  buffers  for  the  flat  schedules  [11].  For  example  for  the 
mobile  satellite  receiver  example  discussed  in  [17],  an  optimum  single  appearance  schedule  under 
the  criterion  of  Ritz,  Willem  and  Meyr  requires  1920  units  of  memory.  In  contrast,  we  have  found 
that  the  minimum  achievable  buffer  memory  requirement  over  all  (not  necessarily  flat)  single 
appearance  schedules  is  1542  for  this  example,  and  this  minimum  value  is  achieved  by  the 
APGAN  heuristic,  which  is  discussed  in  [2]. 

When  there  is  enough  memory  to  accommodate  the  minimum  context- switch  schedules  of 
[17],  it  is  possible  that  these  schedules  will  result  in  somewhat  higher  throughput  than  the  sched¬ 
ules  discussed  in  this  paper,  although  the  difference  in  context-switch  overhead  can  often  be  sig¬ 
nificantly  mitigated  by  the  techniques  described  in  [15].  However,  when  memory  constraints  are 
severe,  the  techniques  discussed  in  this  paper  are  superior. 

The  basic  structure  of  the  dynamic  programming  techniques  developed  in  this  paper  and  in 
[12]  was  inspired  by  Godbole’s  dynamic  programming  algorithm  for  matrix-chain  multiplication, 
which  is  presented  in  [6] . 


5 


2.  Dynamic  Programming  Post  Optimization 


Suppose  that  G  is  a  connected,  delayless,  acyclic  SDF  graph,  S  is  valid  single  appearance 
schedule  for  G,  lexorder(S )  =  (A ,,  A2,  ...,  An)  ,  and  SO0  is  an  order-optimal  schedule  for 
(G,  lexorder(S ))  .  If  G  contains  at  least  two  actors,  then  it  can  be  shown  [2]  that  there  exists  a 
valid  schedule  of  the  form  SR  =  ( iLBL)(iRBR )  such  that 
buffer_memory(S R)  =  buffer _memory(S00)  and  for  some  p  e  {1,2,  1 ) }  , 

lexorder(BL )  =  (A1?  A2...,  A  )  and  lexorder(B  R)  =  (A  +  1?  A  +2,  ...,  An)  .Furthermore, 
from  the  order-optimality  of  5'00 ,  clearly,  (iLBL)  and  (iRBR)  must  also  be  order-optimal. 

From  this  observation,  we  can  efficiently  compute  an  order-optimal  schedule  for  G  if  we 
are  given  an  order-optimal  schedule  Sa  b  for  the  subgraph  corresponding  to  each  proper  subse¬ 
quence  Aa,  Aa  +  j,  ...,  Ah  of  lexorder(S)  suchthat(l).  (b-a)<(n- 2)  and  (2).  a  =  1  or 

b  =  n  .  Given  these  schedules,  an  order-optimal  schedule  for  G  can  be  derived  from  a  value  of  x , 
1  <  x  <  n  that  minimizes 

buffer _memory ( S ]  t)  +  buffer _memory(Sx+l  n)  +  ^  TNSE(e)  ,  where 

ee  Es 

Es  =  {e\ (src(e)e  { A v  A2,  ...,  Axj  and  snk(e)  e  {Ax+ 1?  AJC  +  2,  ...,  An})|  is  the  set  of 

edges  that  “cross  the  split”  if  the  schedule  parenthesization  is  split  between  A  Y  and  Ax+l. 

DPPO  is  based  on  repeatedly  applying  this  idea  in  a  bottom-up  fashion  to  the  given  lexical 
ordering  lexorder(S)  .  First,  all  two  actor  subsequences  (A  t,  A2)  ,  (A2,  A3),  ...,  (An_1,  An) 
are  examined  and  the  minimum  buffer  memory  requirements  for  the  edges  contained  in  each  sub¬ 
sequence  are  recorded.  This  information  is  then  used  to  determine  an  optimal  parenthesization 
split  and  the  minimum  buffer  memory  requirement  for  each  three  actor  subsequence 
(A .,  A  j  +  j,  Aj  +  2)  ;  the  minimum  requirements  for  the  two-  and  three-actor  subsequences  are  used 
to  determine  the  optimal  split  and  minimum  buffer  memory  requirement  for  each  four  actor  sub¬ 
sequence;  and  so  on,  until  an  optimal  split  is  derived  for  the  original  n  -actor  sequence 
lexorder(S)  .  An  order-optimal  schedule  can  easily  be  constructed  from  a  recursive,  top-down 
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traversal  of  the  optimal  splits  [11]. 

In  the  r  th  iteration  of  this  bottom  up  approach,  we  have  available  the  minimum  buffer 
memory  requirement  b[p,  q ]  for  each  subsequence  (Ap,  A p  +  l,  A  )  that  has  less  than  or 
equal  to  r  members.  To  compute  the  minimum  buffer  memory  requirement  b[i,  j ]  associated 
with  an  r  +  1  -actor  subchain  (A  •,  Ai+l,  Aj) ,  we  determine  a  value  of 
k  e  { i,  i  +  1,  . . j  -  1 }  that  minimizes 

b[i,k]  +  b[k+l,j]  +  citj[k]  ,  (2) 

where  b[x,  x]  =  0  for  all  x  and  ci  ■[ k ]  ,  the  memory  cost  at  the  split  if  we  split  the  subse¬ 
quence  between  Ak  and  Ak  +  ,  is  given  by  [2] 1 


HjW  = 


X  TNSE(e) 

e  e  Es 

gcd({qG(Ax)  (i<x<j)}) 


(3) 


where 


{e\  (src(e)e  {A A . 


i  +  1’ 


Ak }  and  snk(e)  e  {A 


k+v  Ak  +  2’ 


Aj})} 


(4) 


is  the  set  of  edges  that  cross  the  split. 


3.  Extension  to  Arbitrary  Topologies 


DPPO  can  be  extended  to  efficiently  handle  graphs  that  are  not  necessarily  delayless, 
although  a  few  additional  considerations  arise.  We  refer  to  our  extension  as  Generalized  DPPO 
(GDPPO).  First,  if  delays  are  present,  then  lexorder(S)  ,  the  lexical  ordering  of  the  input  sched¬ 
ule,  is  not  necessarily  a  topological  sort.  As  a  consequence,  generally  not  all  parenthesizations  of 
the  input  schedule  will  be  valid.  For  example,  suppose  that  we  are  given  the  valid  schedule 

1.  The  symbol  gcd  denotes  the  greatest  common  divisor. 
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S  =  (6A)(5(2C)(35))  for  Fig.  2.  Then  lexorcler(S)  =  ( A,C,B )  clearly  is  not  a  topologi¬ 
cal  sort,  and  it  is  easily  verified  that  the  schedule  that  corresponds  to  splitting  the  outermost  paren- 
thesization  between  C  and  B  —  (2(3A)(5C))(155)  —  is  not  a  valid  schedule  since  there  is 

not  sufficient  delay  on  the  edge  ( B ,  C )  to  fire  10  invocations  of  C  before  a  single  invocation  of 
B. 

Thus,  we  see  that  when  delays  are  present,  the  set  Es  defined  in  (4)  no  longer  generally 
gives  all  of  the  edges  that  cross  the  parenthesization  split.  We  must  also  examine  the  set  of  back 
edges 


{e\ (snk(e) 


{  Ap  A  { +  | , 


A,}  and  src(e)  e  {A 


k+v  Ak  +  2’ 


Aj})} 


(5) 


Each  e  e  Eb  must  satisfy 


delay  (e)  > 


_ TNSE(e) _ 

gcd({qG(Ax)  (i<x<j)}) 


(6) 


otherwise  the  given  parenthesization  split  will  give  a  schedule  that  is  not  valid.  To  take  into 
account  any  nonzero  delays  on  members  in  Es ,  and  the  memory  cost  of  each  of  the  back  edges, 
the  cost  expression  of  (2)  for  the  given  split  gets  replaced  with 


b[i,  k]  +  b[k  +  1,  j]  + 


X  TNSE(e) 

e  e 

gcd({qG(Ax)\(i<x<j)}) 


+  ^  delay  ( e )  +  ^  delay  ( e ) 
ee  Es  e  e  Eh 


(7) 


Expression  (7)  gives  the  cost  of  spliting  the  subsequence  (A •,  A;-  +  l,  A  ■)  between  A k 


Figure  2.  An  SDF  graph  used  to  illustrate  GDPPO  applied  to  SDF  graphs  that  have 
nonzero  delay  on  one  or  more  edges.  Here  q(A,  B ,  C)  =  (6, 15, 10)  . 
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andAfc  +  1  assuming  that  the  subsequence  (A-,  Ai+  v  ...,  Ak)  precedes  (Ak+  v  Ak  +  2,  Aj)  in 
the  lexical  order  of  the  schedule  that  will  be  implemented.  However,  if  (6)  is  satisfied/or  all  “for¬ 
ward  edges”  e  e  Es,  it  may  be  advantageous  to  interchange  the  lexical  order  of 
(A ■,  Ai+l,  Ak)  and  (Ak  +  l,Ak  +  2,  A-)  .  Such  a  reversal  will  be  advantageous  whenever 

the  reverse  split  cost  defined  by 


b[i,  k]  +  b[k  +  1,  /']  + 


X  TNSE(e) 

_ e  e  Eb _ 

gcd{{<\G(Ax)  (i<x<j)}) 


+  X  delay (e)  +  X  delay (e) 

ee  Eb  e  e  Es 


(8) 


is  less  than  the,  forward  split  cost  computed  from  (7)  —  that  is,  whenever 

X  TNSE(e)  <  X  TNSE(e)  .  (9) 

ee  Eb  ee  Es 

The  possibility  for  reverse  splits  introduces  a  fundamental  difference  between  GDPPO 
and  DPPO:  if  one  or  more  reverse  splits  are  found  to  be  advantageous,  then  GDPPO  does  not  pre¬ 
serve  the  lexical  ordering  of  the  original  schedule.  This  is  not  a  problem  since  in  such  cases  the 
result  computed  by  GDPPO  will  necessarily  have  a  buffer  memory  requirement  that  is  less  than 
that  of  an  order-optimal  schedule  for  lexorcler(S)  .  On  the  contrary,  it  suggests  that  GDPPO  may 
be  applied  multiple  times  in  succession  to  yield  more  benefit  than  a  single  application  —  that  is, 
GDPPO  can  in  general  be  applied  iteratively,  where  the  iterative  application  terminates  when  the 
schedule  produced  by  GDPPO  produces  no  improvement  over  the  schedule  computed  in  the  pre¬ 
vious  iteration. 

Fig.  3  shows  an  example  where  multiple  applications  of  GDPPO  are  beneficial.  Here 
q (A,  B,  C)  =  (2,1,2)  ,  and  the  initial  schedule  is  S  =  ( 2A)B(2C )  ,  so  the  initial  lexical 
ordering  is  (A,  B ,  C)  .  Upon  application  of  GDPPO,  the  minimum  cost  for  the  subsequence 
(A,  B)  is  found  to  be  2 ,  and  the  minimum  cost  for  the  subsequence  (5,  C)  is  found  to  occur 
with  a  reverse  split  that  has  a  cost  of  2  .  The  minimum  cost  for  the  “top-level”  subsequence 
(A,  B,  C )  is  taken  as  the  minimum  cost  over  the  cost  if  the  parenthesization  is  split  between  A 
and  B ,  which  is  equal  to0  +  2  +  5  +  0  =  7  from  (8),  and  the  minimum  cost  if  the  split  occurs 
between  B  and  C ,  which  is2  +  0  +  7  +  0  =  9.  Thus,  the  former  split  is  taken,  and  the  result  of 
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applying  GDPPO  once  to  S  is  the  schedule  5,  =  (2A)(1(2C)(15))  ,  which  has  a  buffer 
memory  requirement  of  7  ,  and  a  lexical  ordering  that  is  different  from  that  of  S . 

Since  lexorder(S x)  ^  lexorcler(S)  ,  it  is  conceivable  that  applying  GDPPO  to  S1  can 
further  reduce  the  buffer  memory  requirement.  Applying  GDPPO  to  Sj ,  we  generate  a  minimum 
cost  of  1  —  which  corresponds  to  another  reverse  split  —  for  (A,  C)  ,  and  we  generate  a  mini¬ 
mum  cost  of  2  for  (C,  B)  .  Thus,  we  see  that  splitting  (A,  C,  B)  between  A  and  C  gives  a  cost 
of0  +  2  +  5+  0  =  7,  while  splitting  between  C  and  B  gives  a  cost  ofl+0  +  2  +  2  =  5  .  The 
result  of  GDPPO  is  thus  the  schedule  S2  =  (2 CA)B  ,  and  a  buffer  memory  requirement  of  5  .  It 
is  easily  verified  that  application  of  GDPPO  to  S2  yields  no  further  improvement,  and  thus  itera¬ 
tive  application  of  GDPPO  terminates  after  three  iterations. 

Although  the  iterative  application  of  GDPPO  is  conceptually  interesting,  we  have  found 
that  for  all  of  the  practical  SDF  graphs  that  we  have  applied  it  to,  termination  occurred  after  only 
2  iterations,  which  means  that  no  further  improvement  was  ever  generated  by  a  second  applica¬ 
tion  of  GDPPO.  This  suggests  that  when  compile-time  efficiency  is  a  significant  issue,  it  may  be 
preferable  to  bypass  iterative  application  of  GDPPO,  and  immediately  accept  the  schedule  pro¬ 
duced  by  the  first  application. 

GDPPO  can  be  implemented  efficiently  by  updating  forward  and  reverse  costs  incremen¬ 
tally.  If  we  are  examining  the  splits  of  the  subsequence  (A  •,  Aj+  ,,  ...,  A-) ,  and  we  have  com¬ 
puted  the  forward  and  reverse  split  costs  Fk  and  Rk  associated  with  the  split  between  Ak  and 


Figure  3.  An  illustration  of  iterative  application  of  GDPPO. 
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Ak  +  l,  i  <  k  <  (j  -  1 )  ,  then  the  splits  costs  Fk  +  j  and  Rk  +  j  associated  with  the  split  between 
Ak  +  1  and  Ak  +  2  can  easily  be  derived  by  examining  the  output  and  input  edges  of  Ak  +  l .  To 
ensure  that  we  ignore  reverse  splits  (forward  splits)  that  fail  to  satisfy  (6)  for  all  e  e  Es  (e  e  Eh) 

a  cost  of  M  =  1  +  ^  ( TNSE(e)  +  delay (e))  is  added  to  the  reverse  (forward)  split  cost  for 

'  e  e  E  ' 

any  input  edge  (output  edge)  e  of  Ak  +  ,  whose  source  (sink)  is  a  member  of 
( Ak  +  2,  Ak  +  3,  A  j) ,  and  that  does  not  satisfy  (6).  Similarly,  for  each  output  (input)  edge  e  of 
Ak+  1  whose  sink  (source)  is  contained  in  (A-,  Aj+  (,  Ak) ,  and  that  does  not  satisfy  (6),  M  is 
subtracted  from  Rk  +  x  ( F k  +  , )  since  such  an  edge  no  longer  prevents  the  split  from  being  valid. 
Choosing  M  so  large  has  the  effect  of  “invalidating”  any  cost  CM  that  has  M  added  to  it  (without 
a  corresponding  subtraction)  since  any  minimal  valid  schedule  has  a  buffer  memory  requirement 
less  than  M .  and  thus,  any  valid  split  will  be  chosen  over  a  split  that  has  cost  CM . 

If  forward  and  reverse  costs  are  updated  in  this  incremental  fashion,  then  GDPPO  attains  a 
time  complexity  of  0(n^v)  1  where  nv  is  the  number  of  actors,  if  we  can  assume  that  the  number 
of  input  and  output  edges  of  each  actor  is  always  bounded  by  some  constant  a .  In  the  absence  of 

3 

such  a  bound,  GDPPO  has  time  complexity  that  is  0(nenv )  ,  where  ne  is  the  number  of  edges  in 
the  input  graph. 

4.  Experimental  Results 

In  [2],  two  heuristics,  called  APGAN  and  RPMC,  are  described  for  constructing  single 
appearance  schedules  that  minimize  the  buffer  memory  requirement.  APGAN  is  a  bottom-up 
clustering  technique  that  has  been  found  to  perform  well  for  graphs  that  have  regular  topological 
structures  and  sample-rate  changes.  RPMC  is  a  top-down  technique  based  on  a  generalized  mini- 

1.  A  function  f(x )  is  ()( g(x))  if  for  sufficiently  large  x ,  fix)  is  bounded  above  by  a  positive  real  multiple 
of  g(x)  . 
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mum-cut  operation  that  usually  does  not  perform  as  well  as  APGAN  on  regular  graphs,  but  often 
significantly  outperforms  APGAN  on  graphs  that  have  irregular  structure.  Thus,  APGAN  and 
RPMC  are  complementary,  —  when  one  of  the  techniques  fails  to  construct  a  good  schedule,  the 
other  can  be  expected  to  be  find  one  [2]. 

Fig.  4  shows  a  screendump  from  the  Ptolemy  prototyping  environment  [5]  that  contains  an 
SDF  description  of  a  four  channel,  nonuniform  filter  bank.  The  single  appearance  schedule 
obtained  by  APGAN  on  this  system  is 

( 3(3k(3ab)cd(2e)h)(2nugj)(2f(2i)))(4l(2o)q)(4mpr)(3(2(2s)t)(3(2w)xv(3yzA ))) 

and  the  resulting  buffer  memory  requirement  is  153  .  After  post-processing  the  schedule  with 
GDPPO,  we  obtain  the  schedule 

( 3(3(3ab)cd(2e)kh)(2gnufj))(4(3i)l(2o)qmpr(3s))(3(2t)(3(2w)xv(3yzA ))) 

which  has  a  buffer  memory  requirement  of  137  —  a  10.5%  improvement.  Notice  that  in  this 
example,  GDPPO  has  changed  the  lexical  ordering,  and  thus,  one  or  more  reverse  splits  were 
found  to  be  beneficial. 

The  schedule  returned  by  RPMC  has  a  buffer  memory  requirement  of  131 ,  which  is  lower 
than  that  obtained  by  the  combination  of  APGAN  and  GDPPO.  However,  GDPPO  is  able  to 
improve  the  schedule  obtained  by  RPMC  even  further.  The  result  computed  by  GDPPO  when 
applied  to  the  schedule  derived  by  RPMC  is 

( 3(3(3ab)dc)(2(3e)fgnj)(3kh))(4(3i)mpl(2o)qr(3s))(3(2tu(3w))(3xv(3yzA ))) 


A  Nonuniform  filterbank. 

The  highpass  component  retains  1/3  of  the 
spectrum  at  each  stage  while  the  lowpass 


Figure  4.  Non-uniform  filterbank  example.  The  SDF  parameters  have  been 
shown  wherever  they  are  different  from  unity. 
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which  has  a  buffer  memory  requirement  of  128  . 


Table  1  shows  the  results  of  applying  GDPPO  to  the  schedules  generated  by  APGAN  and 
RPMC  on  several  practical  SDF  systems.  The  columns  labeled  “%  Impr.”  show  the  percentage  of 
buffer  memory  reduction  obtained  by  GDPPO.  The  QMF  tree  filter  banks  fall  into  a  class  of 
graphs  for  which  APGAN  is  guaranteed  to  produce  optimal  results  [2],  and  thus  there  is  no  room 
for  GDPPO  to  produce  improvement  when  APGAN  is  applied  to  these  two  examples.  Overall,  we 
see  that  GDPPO  produces  an  improvement  in  1 1  out  of  the  14  heuristic/application  combinations. 
A  “significant'’  (greater  than  5%)  improvement  is  obtained  in  9  of  the  14  combinations;  the  mean 
improvement  over  all  14  combinations  is  9.9%;  and  from  the  CD-DAT  and  DAT-CD  examples, 
we  see  that  it  is  possible  to  obtain  very  large  reductions  in  the  buffer  memory  requirement  with 
GDPPO. 


Table  1.  Performance  of  GDPPO  on  several  practical  SDF  systems. 


Application 

APGAN 

only 

APGAN 

+DPPO 

% 

Impr. 

RPMC 

only 

RPMC 
+  DPPO 

% 

Impr. 

Nonuniform  filter  bank 
(1/3,  2/3  splits,  4  channels) 

153 

137 

10.5 

131 

128 

2.34 

Nonuniform  filter  bank 
(1/3,  2/3  splits,  6  channels) 

856 

756 

11.7 

690 

589 

14.6 

QMF  tree  filter  bank 
(8  channels) 

78 

78 

0 

92 

87 

5.43 

QMF  tree  filter  bank 
(6  channels) 

166 

166 

0 

218 

200 

8.26 

Two-stage  fractional 
decimation  system 

140 

119 

15.0 

133 

133 

0 

CD-DAT 

sample  rate  conversion 

396 

382 

3.54 

535 

400 

25.2 

DAT-CD 

sample  rate  conversion 

205 

182 

11.2 

275 

191 

30.5 
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5.  Adaptation  to  Minimize  Code  Size  for  Arbitrary  Schedules 


We  have  applied  the  basic  concept  behind  DPPO  to  derive  an  algorithm  that  computes  an 
optimally  compact  looped  schedule  (minimum  code  size)  for  an  arbitrary  sequence  of  actor  fir¬ 
ings.  For  example,  consider  the  SDF  graph  in  Fig.  5,  and  suppose  that  we  are  given  the  valid  firing 
sequence1  a  =  ABABCBC  (this  firing  sequence  minimizes  the  buffer  memory  requirement  over 
all  valid  schedules).  If  the  code  size  cost  (number  of  program  memory  words  required)  for  a  code 
block  for  A  is  greater  than  the  code  size  cost  for  C ,  then  the  optimally  compact  looped  schedule 
for  a  is  (2 AB)CBC  ,  whereas  ABA(2BC)  is  optimal  if  C  has  a  greater  code  size  cost  than  A  . 

To  understand  how  dynamic  programming  can  be  used  to  compute  an  optimally  compact 
loop  structure,  suppose  that  a  =  AlA2...An  is  the  given  firing  sequence,  and  let 
o'  =  AjAj  +  j . .  .Aj  be  any  subsequence  of  a  (1  <  i  <  j  <n ).  If  the  optimal  loop  structures  for  all 
(j  -  i)  -length  subsequences  of  o  are  available,  then  we  determine  the  optimal  loop  structure  for 
o'  by  first  computing  Ck  =  Zt  k  +  Zk  +  l  ■  for  k  e  {i,  i  +  1,  . j  -  1 }  ,  where  ZY  denotes  the 
minimum  code  size  cost  for  the  subsequence  A  x,  A  v+  1?  . ..,  A  .  The  value  of  k  that  minimizes  Ck 
gives  an  optimum  point  at  which  to  “split”  the  subsequence  if  A  -A  •  +  l . . .  A  ■  are  not  to  be  executed 
through  a  single  loop. 

To  compute  the  minimum  cost  attainable  for  o'  if  AjAj  +  j . . .  A  ■  are  to  be  executed  through 
a  single  loop,  we  first  determine  whether  or  not  o'  =  A -A .. . .  A  j .  If  this  holds,  then  o'  can  be  exe¬ 
cuted  through  a  single  loop  ((_/-/  +  1  )A;)  ,  and  the  code  size  cost  is  taken  to  be  the  code  size 
cost  of  Aj  plus  the  code  size  overhead  CL  of  a  loop.  If  o'  ^  A  -A  -...A  -,  we  determine  whether  or 
not  o'  =  AjAj +lAjAi+l... AjAj +l,  and  if  so,  then  o'  can  be  implemented  as 
(((_/-/+  1)/2)A-A-  +  j)  ,  and  the  code  size  cost  is  taken  to  be  the  sum  of  CL  and  the  costs  of 

- -5© 

Figure  5.  An  SDF  graph  used  to  illustrate  the  problem  of  finding  an  optimally  com¬ 
pact  loop  structure  for  an  arbitrary  firing  sequence. 

1.  By  a  firing  sequence ,  we  simply  mean  a  schedule  that  contains  no  loops. 
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A-  and  Aj+l.  Next,  if  c'  A  AjAi...Ai  and  a'  A  A-A-  +  ,  A  (A  •  +  l...AiAi+  1?  then  we  determine 
whether  or  not  a'  =  A -A  •  +  , A ( +  9 . . .  A -A  •  +  j  A  •  +  2.  If  this  holds  then  a'  can  be  implemented  as 
(((_/  -  i  +  l)/3)SL(AjAi+lAi  +  2))  ,  where  SL{AtAi+  xAi  +  2)  is  an  optimal  loop  structure  for 
A -A  •  +  i  A  ■  +  2 .  It  is  easily  seen  that  an  optimal  loop  structure  L  for  executing  AjAj  +  1 . . .  A  •  through 
a  single  loop  can  be  determined  (if  one  exists)  by  iterating  this  procedure  L (_/-/+  1)/2J  times, 
where  |_*  J  denotes  th e  floor  operator.  If  one  or  more  loop  structures  exist  for  executing 
AjAj  +  j  ■■■Aj  through  a  single  loop,  then  the  code  size  cost  of  the  optimal  loop  structure  L  is  com¬ 
pared  to  the  minimum  value  of  Ck,  i  <  k  <  j,  to  determine  an  optimal  loop  structure  for 
AjAj+l...Ap  otherwise  the  optimal  loop  structure  for  A;A;  +  l...Aj  is  taken  to  be  that  correspond¬ 
ing  to  the  minimum  value  of  Ck . 

The  time  complexity  of  this  technique  for  finding  an  optimal  loop  structure  for  the  subse¬ 
quence  AjAi+l...A  .,  given  optimal  looping  structures  for  all  (j  -  i)  -length  subsequences,  is 

2  4 

0((j  -  i  +  1)  )  ,  and  time  complexity  of  the  overall  algorithm  is  O  ( n  )  ,  where  n  is  the  number 

of  firings  in  the  input  firing  sequence.  Thus,  the  problem  of  determining  an  optimal  looping  struc¬ 
ture  is  of  polynomial  complexity  when  the  size  of  a  problem  instance  is  taken  to  be  the  number  of 
firings  in  the  given  firing  sequence.  However,  it  should  be  noted  that  there  is  no  polynomial  func¬ 
tion  P(m)  such  that  the  number  of  firings  in  a  valid  schedule  (defined  as  the  sum  of  the  entries  in 
the  repetitions  vector)  is  guaranteed  to  be  less  than  P(m)  for  an  arbitrary  m  -actor  SDF  graph. 
Thus,  unlike  DPPO  and  GDPPO,  the  algorithm  developed  in  this  section  is  not  of  polynomial 
complexity  in  the  size  of  the  given  SDF  graph. 


6.  Conclusion 


This  paper  has  developed  a  dynamic  programming  post  optimization,  called  GDPPO,  for 
reducing  the  data  memory  cost  of  software  implementations  of  synchronous  dataflow  graphs  that 
minimize  code  size  for  in-line  code  generation.  We  have  presented  data  on  several  practical  exam¬ 
ples  that  shows  that  GDPPO  can  produce  significant  improvements  when  appended  to  either  of 
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two  existing  heuristics,  APGAN  and  RPMC,  for  constructing  minimum  code  size  schedules. 
Since  these  two  heuristics  are  fundamentally  different  in  structure  —  one  is  based  on  top-down 
partitioning,  and  the  other  is  based  on  bottom-up  clustering  —  we  expect  that  GDPPO  can  yield 
similar  benefits  when  used  to  improve  the  performance  of  alternative  scheduling  algorithms.  We 
have  also  presented  an  adaptation  of  GDPPO  to  minimize  the  code  size  of  an  arbitrary  synchro¬ 
nous  dataflow  schedule. 
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